You are currently looking at version 1.2 of this notebook. To download notebooks and datafiles, as well as get help on Jupyter notebooks in the Coursera platform, visit the Jupyter Notebook FAQ course resource.


Assignment 2 - Pandas Introduction

All questions are weighted the same in this assignment.

Part 1

The following code loads the olympics dataset (olympics.csv), which was derrived from the Wikipedia entry on All Time Olympic Games Medals, and does some basic data cleaning.

The columns are organized as # of Summer games, Summer medals, # of Winter games, Winter medals, total # number of games, total # of medals. Use this dataset to answer the questions below.


In [2]:
import pandas as pd

df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

for col in df.columns:
    if col[:2]=='01':
        df.rename(columns={col:'Gold'+col[4:]}, inplace=True)
    if col[:2]=='02':
        df.rename(columns={col:'Silver'+col[4:]}, inplace=True)
    if col[:2]=='03':
        df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)
    if col[:1]=='№':
        df.rename(columns={col:'#'+col[1:]}, inplace=True)

names_ids = df.index.str.split('\s\(') # split the index by '('

df.index = names_ids.str[0] # the [0] element is the country name (new index) 
df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)

df = df.drop('Totals')
df.head()


Out[2]:
# Summer Gold Silver Bronze Total # Winter Gold.1 Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 Combined total ID
Afghanistan 13 0 0 2 2 0 0 0 0 0 13 0 0 2 2 AFG
Algeria 12 5 2 8 15 3 0 0 0 0 15 5 2 8 15 ALG
Argentina 23 18 24 28 70 18 0 0 0 0 41 18 24 28 70 ARG
Armenia 5 1 2 9 12 6 0 0 0 0 11 1 2 9 12 ARM
Australasia 2 3 4 5 12 0 0 0 0 0 2 3 4 5 12 ANZ

Question 0 (Example)

What is the first country in df?

This function should return a Series.


In [3]:
# You should write your whole answer within the function provided. The autograder will call
# this function and compare the return value against the correct solution value
def answer_zero():
    # This function returns the row for Afghanistan, which is a Series object. The assignment
    # question description will tell you the general format the autograder is expecting
    return df.iloc[0]

# You can examine what your function returns by calling it in the cell. If you have questions
# about the assignment formats, check out the discussion forums for any FAQs
answer_zero()


Out[3]:
# Summer           13
Gold                0
Silver              0
Bronze              2
Total               2
# Winter            0
Gold.1              0
Silver.1            0
Bronze.1            0
Total.1             0
# Games            13
Gold.2              0
Silver.2            0
Bronze.2            2
Combined total      2
ID                AFG
Name: Afghanistan, dtype: object

Question 1

Which country has won the most gold medals in summer games?

This function should return a single string value.


In [14]:
def answer_one():
    maxim = df['Gold'].max()
    most_gold = df.index[df['Gold'] == maxim]
    return most_gold[0]
answer_one()


Out[14]:
'United States'

Question 2

Which country had the biggest difference between their summer and winter gold medal counts?

This function should return a single string value.


In [20]:
def answer_two():
    gold_sum_win_diff = abs(df['Gold'] - df['Gold.1'])
    biggest_idx = gold_sum_win_diff.idxmax()
    
    return biggest_idx

answer_two()


Out[20]:
'United States'

Question 3

Which country has the biggest difference between their summer gold medal counts and winter gold medal counts relative to their total gold medal count?

$$\frac{Summer~Gold - Winter~Gold}{Total~Gold}$$

Only include countries that have won at least 1 gold in both summer and winter.

This function should return a single string value.


In [21]:
def answer_three():
    new_df = df[(df['Gold'] > 0) & (df['Gold.1'] > 0)]
    summer_gold = new_df['Gold']
    wint_gold = new_df['Gold.1']
    tot_gold = new_df['Combined total']
    measure = (summer_gold - wint_gold)/tot_gold
    return measure.idxmax()

answer_three()


Out[21]:
'China'

In [22]:
df.shape


Out[22]:
(146, 16)

Question 4

Write a function that creates a Series called "Points" which is a weighted value where each gold medal (Gold.2) counts for 3 points, silver medals (Silver.2) for 2 points, and bronze medals (Bronze.2) for 1 point. The function should return only the column (a Series object) which you created.

This function should return a Series named Points of length 146


In [29]:
def answer_four():
    gold_2_val = df['Gold.2']*3
    silver_2_val = df['Silver.2']*2
    bronze_2_val = df['Bronze.2']*1
    
    Points = pd.Series(gold_2_val+silver_2_val+bronze_2_val, name='Points')
    return Points

#answer_four()

Part 2

For the next set of questions, we will be using census data from the United States Census Bureau. Counties are political and geographic subdivisions of states in the United States. This dataset contains population data for counties and states in the US from 2010 to 2015. See this document for a description of the variable names.

The census dataset (census.csv) should be loaded as census_df. Answer questions using this as appropriate.

Question 5

Which state has the most counties in it? (hint: consider the sumlevel key carefully! You'll need this for future questions too...)

This function should return a single string value.


In [31]:
census_df = pd.read_csv('census.csv', encoding='iso-8859-1')
census_df.head()


Out[31]:
SUMLEV REGION DIVISION STATE COUNTY STNAME CTYNAME CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 ... RDOMESTICMIG2011 RDOMESTICMIG2012 RDOMESTICMIG2013 RDOMESTICMIG2014 RDOMESTICMIG2015 RNETMIG2011 RNETMIG2012 RNETMIG2013 RNETMIG2014 RNETMIG2015
0 40 3 6 1 0 Alabama Alabama 4779736 4780127 4785161 ... 0.002295 -0.193196 0.381066 0.582002 -0.467369 1.030015 0.826644 1.383282 1.724718 0.712594
1 50 3 6 1 1 Alabama Autauga County 54571 54571 54660 ... 7.242091 -2.915927 -3.012349 2.265971 -2.530799 7.606016 -2.626146 -2.722002 2.592270 -2.187333
2 50 3 6 1 3 Alabama Baldwin County 182265 182265 183193 ... 14.832960 17.647293 21.845705 19.243287 17.197872 15.844176 18.559627 22.727626 20.317142 18.293499
3 50 3 6 1 5 Alabama Barbour County 27457 27457 27341 ... -4.728132 -2.500690 -7.056824 -3.904217 -10.543299 -4.874741 -2.758113 -7.167664 -3.978583 -10.543299
4 50 3 6 1 7 Alabama Bibb County 22915 22919 22861 ... -5.527043 -5.068871 -6.201001 -0.177537 0.177258 -5.088389 -4.363636 -5.403729 0.754533 1.107861

5 rows × 100 columns


In [82]:
df_level = census_df[census_df['SUMLEV'] == 50]
group = df_level.groupby(['STNAME']).size()
#group

In [ ]:


In [66]:
def answer_five():
    df_level = census_df[census_df['SUMLEV'] == 50]
    group = df_level.groupby(['STNAME']).size().reset_index(name='freq')
    
    max_count_state = group['freq'].idxmax()
    return group['STNAME'][max_count_state]

answer_five()


Out[66]:
'Texas'

In [91]:
df_level = census_df[census_df['SUMLEV'] == 50]
top_count = df_level.sort_values(by=['STNAME', 'CENSUS2010POP'], ascending=False)
top_count = top_count.groupby('STNAME').head(3)

maxim = top_count.groupby('')
top_count


Out[91]:
SUMLEV REGION DIVISION STATE COUNTY STNAME CTYNAME CENSUS2010POP ESTIMATESBASE2010 POPESTIMATE2010 ... RDOMESTICMIG2011 RDOMESTICMIG2012 RDOMESTICMIG2013 RDOMESTICMIG2014 RDOMESTICMIG2015 RNETMIG2011 RNETMIG2012 RNETMIG2013 RNETMIG2014 RNETMIG2015
3180 50 4 8 56 21 Wyoming Laramie County 91738 91881 92271 ... -1.200428 15.547274 4.787847 -1.226133 0.278940 -0.973320 17.914554 6.003143 -0.207819 1.673640
3182 50 4 8 56 25 Wyoming Natrona County 75450 75450 75472 ... 7.189319 23.066162 24.322042 -0.958472 -0.061057 7.689674 23.749508 25.085233 -0.110593 0.793743
3172 50 4 8 56 5 Wyoming Campbell County 46133 46133 46244 ... -2.843479 15.601020 -5.895711 -8.550911 10.916963 -2.649606 15.558684 -5.916543 -8.509402 10.978525
3137 50 2 3 55 79 Wisconsin Milwaukee County 947735 947736 948301 ... -4.417899 -5.257085 -5.581617 -8.303054 -8.604921 -2.352634 -2.953702 -3.114494 -5.344614 -5.623163
3109 50 2 3 55 25 Wisconsin Dane County 488073 488075 489187 ... 5.468267 4.529574 3.392786 3.107322 2.481328 8.430668 7.729273 6.874337 7.038202 6.383028
3164 50 2 3 55 133 Wisconsin Waukesha County 389891 389938 390076 ... -0.765799 2.128860 0.038132 0.760109 -0.719858 0.102448 3.180527 1.189727 2.077633 0.593567
3060 50 3 5 54 39 West Virginia Kanawha County 193063 193058 192939 ... -3.486747 0.489521 -2.425191 -4.738889 -8.824703 -3.128199 1.124856 -1.861921 -4.094107 -8.164171
3042 50 3 5 54 3 West Virginia Berkeley County 104169 104172 104646 ... 4.906062 7.784296 8.403712 11.389012 8.385254 5.600137 8.574007 9.127210 12.220127 9.203986
3046 50 3 5 54 11 West Virginia Cabell County 96319 96316 96369 ... 1.689758 1.002791 0.092837 -3.322773 -2.055031 2.716053 2.243358 1.402865 -1.795536 -0.536993
3017 50 4 9 53 33 Washington King County 1931249 1931256 1937699 ... 5.141363 5.472471 5.662073 3.337122 2.840873 11.616146 12.013317 12.704538 11.206066 10.589099
3027 50 4 9 53 53 Washington Pierce County 795225 795229 795422 ... 1.940276 0.261251 -0.170414 5.168140 5.723786 3.714170 3.765229 2.480196 7.920039 8.699104
3031 50 4 9 53 61 Washington Snohomish County 713335 713330 715515 ... 0.605003 5.006397 8.056249 7.957886 6.603487 3.660617 8.417289 11.496756 11.768105 10.410479
2895 50 3 5 51 59 Virginia Fairfax County 1081726 1081685 1086835 ... -1.068978 -4.301956 -5.331461 -15.828378 -14.700525 7.703572 4.663819 3.910625 -5.532061 -4.358958
2996 50 3 5 51 810 Virginia Virginia Beach city 437994 437966 439038 ... -1.147374 -7.624586 -3.757063 -7.275564 -7.954636 1.693851 -1.838632 0.514054 -2.853337 -2.954769
2939 50 3 5 51 153 Virginia Prince William County 402002 401972 406371 ... 14.812914 7.694723 2.839642 -3.671194 -2.266019 20.051539 13.319960 8.118846 2.139833 3.637673
2855 50 1 1 50 7 Vermont Chittenden County 156545 156540 156805 ... -0.044508 -0.031596 0.451860 -1.652769 -3.669383 2.549666 2.932088 3.577223 1.927191 -0.068296
2862 50 1 1 50 21 Vermont Rutland County 61642 61646 61585 ... -3.467615 -4.976224 -3.094650 -5.714996 -5.655373 -3.272257 -4.517887 -2.600823 -5.052387 -4.988072
2863 50 1 1 50 23 Vermont Washington County 59534 59526 59570 ... -1.578161 -5.617505 -3.407759 -4.097008 -6.278073 -1.376694 -5.281128 -3.087227 -3.741482 -5.920784
2839 50 4 8 49 35 Utah Salt Lake County 1029655 1029581 1033182 ... 0.521647 1.455453 0.943194 -2.857349 -0.895372 3.354680 4.171416 3.911645 0.519769 2.462501
2846 50 4 8 49 49 Utah Utah County 516564 516640 520011 ... -0.799582 -1.173132 2.685569 -3.000162 4.423179 0.795774 0.325040 4.314850 -1.132916 6.270569
2827 50 4 8 49 11 Utah Davis County 306479 306486 307929 ... -1.554701 -2.339717 5.981164 6.635558 4.103723 -0.438671 -0.646208 7.499147 8.238512 5.765039
2667 50 3 7 48 201 Texas Harris County 4092459 4093076 4108187 ... 1.403957 3.518819 4.820913 5.243613 3.794291 7.154417 8.915715 10.683269 11.850629 10.303814
2623 50 3 7 48 113 Texas Dallas County 2368139 2367643 2373464 ... 0.646423 4.853046 -2.117265 -0.754784 -0.981666 5.233271 9.100284 2.574144 4.639783 4.397784
2786 50 3 7 48 439 Texas Tarrant County 1809034 1809531 1816528 ... 5.092591 6.491618 4.793224 5.070751 6.024164 8.010830 9.415741 7.937773 8.697380 9.639477
2549 50 3 6 47 157 Tennessee Shelby County 927644 927640 928618 ... -2.655000 -2.027545 -8.490770 -8.812882 -8.281490 -1.164248 -0.393978 -6.777925 -6.832566 -6.255349
2489 50 3 6 47 37 Tennessee Davidson County 626681 626662 628131 ... 1.147215 9.306546 4.034396 2.611925 2.985201 5.180667 13.245487 8.284266 7.391673 7.719682
2517 50 3 6 47 93 Tennessee Knox County 432226 432234 433035 ... 4.535601 6.348939 2.846599 6.259052 1.875679 5.740801 7.568135 4.120791 7.733088 3.362443
2452 50 2 4 46 99 South Dakota Minnehaha County 169468 169476 170071 ... -0.806859 7.667710 9.812071 4.352593 0.293601 2.812315 11.098607 13.278139 8.246728 4.104979
2455 50 2 4 46 103 South Dakota Pennington County 100948 100937 101300 ... 3.975499 11.718108 9.222836 8.486537 -4.240042 4.289613 13.263778 10.104361 9.315627 -3.281424
2444 50 2 4 46 83 South Dakota Lincoln County 44828 44823 45177 ... 21.858518 19.072051 19.974142 18.716288 14.429987 22.097765 19.261300 20.177752 18.933231 14.640783
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
464 50 3 5 13 135 Georgia Gwinnett County 805321 805324 808274 ... 5.191468 4.425225 8.411919 8.649629 6.821351 9.786512 8.450342 12.819901 13.668649 11.767366
441 50 3 5 13 89 Georgia DeKalb County 691893 691891 692485 ... -8.622295 -0.980994 -5.923632 -4.642907 -0.984173 -2.312702 5.332913 0.928068 3.106845 6.675380
372 50 3 5 12 86 Florida Miami-Dade County 2496435 2498017 2508171 ... 7.464259 -6.908149 -8.130040 -10.464276 -12.205475 23.110902 6.891198 6.633403 6.087936 4.154777
335 50 3 5 12 11 Florida Broward County 1748066 1748148 1753263 ... 7.334407 5.473544 1.555403 1.178479 1.072196 15.871353 13.173887 9.767689 10.436646 10.182937
379 50 3 5 12 99 Florida Palm Beach County 1320134 1320134 1324171 ... 4.736597 8.675158 5.992198 8.893634 9.957796 10.665231 14.033170 11.752520 15.407606 16.364047
328 50 3 5 11 1 District of Columbia District of Columbia 601723 601767 605126 ... 11.560071 10.052444 9.457678 1.480094 5.601833 17.028422 15.972111 15.634120 8.378037 12.434838
325 50 3 5 10 3 Delaware New Castle County 538479 538477 538911 ... -0.529012 -0.503417 -1.522259 -0.039896 -1.843034 2.319516 2.603439 1.788746 3.761076 1.929511
326 50 3 5 10 5 Delaware Sussex County 197145 197110 197890 ... 9.201060 12.056432 14.219606 16.909226 21.384857 11.154778 13.914698 16.152652 19.181043 23.621826
324 50 3 5 10 1 Delaware Kent County 162310 162349 162990 ... 6.373541 6.735706 4.991157 7.304533 3.984871 8.195423 9.487672 7.020855 9.368093 6.197402
315 50 1 1 9 1 Connecticut Fairfield County 916829 916850 919744 ... -0.228261 -3.030708 -2.070291 -6.633789 -7.385939 6.119788 2.687406 4.096896 0.441829 -0.338989
316 50 1 1 9 3 Connecticut Hartford County 894014 894029 895303 ... -5.037297 -6.324143 -5.237664 -8.636074 -8.135161 -0.025666 -1.662136 -0.314193 -2.840433 -2.327200
319 50 1 1 9 9 Connecticut New Haven County 862477 862474 863401 ... -5.240312 -5.579033 -6.802453 -7.571850 -8.434900 -1.074461 -1.736214 -2.755967 -2.767218 -3.620603
271 50 4 8 8 41 Colorado El Paso County 622263 622261 626916 ... 5.840828 0.974565 3.334799 1.508319 5.472036 7.208065 5.279801 6.136092 4.122838 8.714281
266 50 4 8 8 31 Colorado Denver County 600158 599860 603300 ... 14.173852 11.561016 10.981084 11.917335 15.634515 17.098589 14.624806 14.359160 15.751607 19.438429
252 50 4 8 8 5 Colorado Arapahoe County 572003 572155 574727 ... 7.756987 6.256681 8.577457 6.718039 7.128637 11.252972 10.087820 12.627969 11.245590 11.633206
209 50 4 9 6 37 California Los Angeles County 9818605 9818700 9826009 ... -4.999439 -4.071266 -4.332318 -5.394299 -6.027694 -0.007808 0.377510 0.578049 0.215038 -0.437481
227 50 4 9 6 73 California San Diego County 3095313 3095308 3104386 ... 0.354751 0.235513 -0.246567 0.657077 -2.855045 4.382515 5.773072 4.745940 5.921091 2.770965
220 50 4 9 6 59 California Orange County 3010232 3010266 3017866 ... 1.992772 0.618346 -0.491825 -2.374727 -3.182714 6.033306 4.395138 3.681764 2.407607 1.587398
174 50 3 7 5 119 Arkansas Pulaski County 382748 382789 383628 ... 1.085073 -2.101027 -0.709951 -3.577252 -6.701870 2.421945 -0.157255 1.025199 -1.696771 -4.750660
118 50 3 7 5 7 Arkansas Benton County 221339 221344 223010 ... 11.164115 11.786353 9.333699 11.268649 16.254517 13.816535 14.533319 12.296171 14.648828 19.555772
186 50 3 7 5 143 Arkansas Washington County 203065 203060 204026 ... 5.763423 5.801883 10.908233 7.168967 9.211963 9.191373 8.872065 14.267128 10.975336 12.950540
106 50 4 8 4 13 Arizona Maricopa County 3817117 3817357 3825597 ... 2.444413 9.232973 7.879873 8.318210 9.123309 4.726436 11.527593 10.390177 11.203958 11.985998
109 50 4 8 4 19 Arizona Pima County 980263 980263 981870 ... 0.874159 -0.248360 -1.381638 1.036139 0.159859 2.781876 2.000003 0.854104 3.582023 2.777186
110 50 4 8 4 21 Arizona Pinal County 375770 375770 379384 ... -13.828377 4.308298 0.966512 17.437436 18.998761 -11.594501 6.465078 3.391910 20.212795 21.745600
71 50 4 9 2 20 Alaska Anchorage Municipality 291826 291826 293405 ... -3.788123 -7.563712 -3.962343 -18.897015 -19.781254 -0.969922 -2.652342 0.053320 -14.789980 -15.280810
76 50 4 9 2 90 Alaska Fairbanks North Star Borough 97581 97581 98289 ... -17.339069 5.430293 -9.147023 -33.436804 -13.859157 -15.791483 10.618791 -4.926084 -29.823633 -8.984834
85 50 4 9 2 170 Alaska Matanuska-Susitna Borough 88995 88995 89761 ... 13.050517 11.474315 12.906763 15.008086 18.697964 13.391923 12.303913 13.528903 15.605525 19.349985
37 50 3 6 1 73 Alabama Jefferson County 658466 658350 658116 ... -4.138628 -4.492646 -1.664873 -0.970225 -3.337536 -2.985041 -3.536344 -0.647113 0.221333 -2.126089
49 50 3 6 1 97 Alabama Mobile County 412992 413143 413332 ... -5.242052 -3.722733 -3.181655 -2.311087 -3.873201 -3.751237 -2.281051 -1.753896 -0.759909 -2.288272
45 50 3 6 1 89 Alabama Madison County 334811 334811 336138 ... 4.170416 3.438521 6.367708 4.064107 2.480087 5.739131 4.880766 7.818211 5.734524 4.180881

151 rows × 100 columns

Question 6

Only looking at the three most populous counties for each state, what are the three most populous states (in order of highest population to lowest population)? Use CENSUS2010POP.

This function should return a list of string values.


In [ ]:
def answer_six():
    df_level = census_df[census_df['SUMLEV'] == 50]
    group = df_level.groupby(['STNAME']).size()

    return "YOUR ANSWER HERE"

Question 7

Which county has had the largest absolute change in population within the period 2010-2015? (Hint: population values are stored in columns POPESTIMATE2010 through POPESTIMATE2015, you need to consider all six columns.)

e.g. If County Population in the 5 year period is 100, 120, 80, 105, 100, 130, then its largest change in the period would be |130-80| = 50.

This function should return a single string value.


In [ ]:
def answer_seven():
    return "YOUR ANSWER HERE"

Question 8

In this datafile, the United States is broken up into four regions using the "REGION" column.

Create a query that finds the counties that belong to regions 1 or 2, whose name starts with 'Washington', and whose POPESTIMATE2015 was greater than their POPESTIMATE 2014.

This function should return a 5x2 DataFrame with the columns = ['STNAME', 'CTYNAME'] and the same index ID as the census_df (sorted ascending by index).


In [ ]:
def answer_eight():
    return "YOUR ANSWER HERE"